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Abstract. Approximate dictionary matching is a classic string match¬ 
ing problem (checking if a query string occurs in a collection of strings) 
with applications in, e.g., spellchecking, online catalogs, geolocation, and 
web searchers. We present a surprisingly simple solution called a split in¬ 
dex, which is based on the Dirichlet principle, for matching a keyword 
with few mismatches, and experimentally show that it offers competi¬ 
tive space-time tradeoffs. Our implementation in the C++ language is 
focused mostly on data compaction, which is beneficial for the search 
speed (e.g., by being cache friendly). We compare our solution with other 
algorithms and we show that it performs better for the Hamming dis¬ 
tance. Query times in the order of 1 microsecond were reported for one 
mismatch for the dictionary size of a few megabytes on a medium-end 
PC. We also demonstrate that a basic compression technique consisting 
in g-gram substitution can significantly reduce the index size (up to 50% 
of the input text size for the DNA), while still keeping the query time 
relatively low. 


1 Introduction 

Dictionary string matching (keyword matching, matching in dictionaries), de¬ 
fined as the task of checking if a query string occurs in a collection of strings 
given beforehand, is a classic research topic. In recent years, increased inter¬ 
est in approximate dictionary matching can be observed, where the query and 
one of the strings from the dictionary may only be similar in a specified sense 
rather than equal. Approximate dictionary matching is considered a hard prob¬ 
lem, since most useful string similarity measures are non-transitive. On the other 
hand, matching with mismatches (i.e. using a Hamming distance) is also a very 
desired functionality with applications in, i.a., bioinformatics [21,22], biomet¬ 
rics [13], cheminformatics [16], circuit design [19], and web crawling [25[. 

As indexes supporting approximate matching tend to grow exponentially in 
k, the maximum number of allowed errors, it is also a worthwhile goal to design 
efficient indexes supporting only a small k. In this paper, we focus on the problem 
of dictionary matching with few mismatches (especially one mismatch). Formally, 
for a collection T> = {di,..., dm} of |I?| strings (words) di of total length n over 



a given alphabet S (where cr = |£'|), li'D) is an approximate dictionary index 
supporting matching with mismatches, if for any query pattern P it returns all 
strings dj from V such that Ham{P,dj) ^ k (Hamming distance). As regards 
the substrings, they are denoted as ii] (an inclusive range), and indexes are 
0-based. 

2 Related work 

Solutions for approximate dictionary matching can be basically divided into two 
classes: worst-case space and query time oriented, and heuristical ones. Notable 
results from the first class include the /c-errata trie by Cole et al. [11] which 
is based on the suffix tree and the longest common prefix structure. It can 
be used in various contexts, including full-text and keyword indexing, as well 
as wildcard matching. For the Hamming distance and dictionary matching, it 
uses 0 (n + I'D] ) space and offers 0 {m+ log log n -|- occ) query 

time (this also holds for the edit distance but with larger constants). This was 
extended by Tsur [29] who described a structure similar to the one from Cole et 
al. with time complexity 0(TO-|-log log n-|-occ) (for constant k) and space 

for a constant £ > 0. For full-text searching with the Hamming distance, Gabriele 
et al. [17] provided an index with average search time O(m-l-occ) and 0(nlog^ n) 
space (for some 1). Another theoretical work describing the algorithm which is 
similar to our split index was given by Shi and Widmayer [28], who obtained 
0 (n) preprocessing time and space complexity and 0 (n) expected search time 
if k is bounded by 0(m/log to). They introduce the notion of home strings for 
a given g-gram, which is the set of strings in V that contain the g-gram in the 
exact form (the value of q is set to |P|/(fc-|-l). In the search phase, they partition 
P into fc -|- 1 disjoint g-grams and use a candidate inspection order to speed up 
finding the matches with up to k edit distance errors. 

On the practical front, Bocek et al. [3] provided a generalization of the Mor- 
Fraenkel [26] algorithm for fc ^ 1 which is called FastSS. To check if two strings 
and 5*2 match with up to k errors, we first delete all possible ordered subsets 
of k' symbols for all 0 ^ A:' ^ fc from Ai and 82 - Then we conclude that Ai 
and 82 may be in edit distance at most k if and only if the intersection of the 
resulting lists of strings is non-empty (explicit verification is still required). For 
instance, if = abbac and k = 2, then its neighborhood is as follows: abbac, 
bbac, abac, abac, abbc, abba, abb, aba, abc, aba, abc, aac, bba, bbc, bac and 
bac (some of the resulting strings are repeated and they may be removed). If 
82 = baxcy, then its respective neighborhood for k = 2 will contain, e.g., the 
string bac, but the following verification will show that 81 and 82 are in edit dis¬ 
tance greater than 2. If, however, Lev{ 8 i, 82 ) ^ 2 (Levenshtein distance), then 
it is impossible not to have in the neighborhood of 82 at least one string from 
the neighborhood of ^i, hence we will never miss a match. The lookup requires 
0 {km^ log (nm*)) time (where to. is the average dictionary word length) and the 
index occupies 0{nm^) space. Another practical filter was presented by Karch et 
al. [20] and it improved on the FastSS method. They reduced space requirements 



and the query time by splitting long words (similarly to FastBlockSS which is 
a variant of the original method) and storing the neighborhood implicitly with 
indexes and pointers to original dictionary entries. They claimed to be faster 
than other approaches such as the aforementioned FastSS and a BK-tree [6]. 
Recently, Chegrane and Belazzougui [9] described another practical index and 
they reported better results when compared to Karch et al. Their structure is 
based on the dictionary by Belazzougui for the edit distance of 1 (see the follow¬ 
ing subsection). An approximate (in the mathematical sense) data structure for 
approximate matching which is based on the Bloom filter was also described [24]. 

A permuterm index is a keyword index which supports queries with one 
wildcard symbol [18]. The idea is store all rotations of a given word appended 
with the terminating character, for instance for the word text, the index would 
consist of the following permuterm vocabulary: text$, ext$t, xt$te, t$tex, 
$text. When it comes to searching, the query is first rotated so that the wildcard 
appears at the end, and subsequently its prefix is searched for using the index. 
This could be for example a trie or any other data structure which supports a 
prefix lookup. The main problem with the standard permuterm index is its space 
usage, as the number of strings inserted into the data structure is the number 
of words multiplied by the average string length. Ferragina and Venturini [15] 
proposed a compressed permuterm index in order to overcome the limitations of 
the original structure with respect to space. They explored the relation between 
the permuterm index and the Burrows-Wheeler Transform [7], which is applied 
to a concatenation of all strings from the input dictionary. They provided a 
modification of the LF-mapping known from FM-indexes [14] in order to support 
the functionality of the permuterm index. 


2.1 The 1-error problem 

It is important to consider methods for detecting a single error, since over 
80% of errors (even up to roughly 95%) are within /c = 1 for the edit dis¬ 
tance with transpositions [12,27]. Belazzougui and Venturini [2] presented a 
compressed index whose space is bounded in terms of the fc-th order empiri¬ 
cal entropy of the indexed dictionary. It can be based either on perfect hash¬ 
ing, having 0 {m + occ) query time or on a compressed permuterm index with 
0(m min(TO, logj,. n log log n) -I-occ) time (when a = log'^n for some constant 
c) but improved space requirements. The former is a compressed variant of a 
dictionary presented by Belazzougui [1] which is based on neighborhood gen¬ 
eration and occupies 0 {n log a) space and can answer queries in 0 {m) time. 
Chung et al. [10] showed a theoretical work where external memory is used, 
and their focus is on I/O operations. They limited the number of these oper¬ 
ations to 0(1 -|- m/{wB) + occjB), where w is the size of the machine word 
and B is the number of words within a block (a basic unit of I/O), with the 
space of the proposed structure of 0{n/B) blocks. In the category of filters, 
Mor and Fraenkel [26] described a method which is based on the deletion-only 
1-neighborhood. 



For the 1-mismatch problem, Yao and Yao [30] described the data struc¬ 
ture for binary strings with fixed length m with 0{m\oglog\'D\) query time 
and 0{\'D\mlogm) space requirements. This was later improved by Brodal and 
G^ieniec [4] with a data structure with 0{m) query time which occupies 0(n) 
space. This was in turn extended with a structure with 0(1) query time and 
0{\'D\\ogm) space in a cell probe model (where only memory accesses are 
counted) [5]. Another notable example is a recent theoretical work of Chan 
and Lewenstein [8], who introduced the index with optimal query time (i.e. 
0 {m/w + occ), where occ is the number of pattern occurrences) which uses addi¬ 
tional 0 {wd\og^'^‘^ d) bits of space (beyond the dictionary of d strings), assuming 
a constant-size alphabet. 

3 Our algorithm 

The algorithm that we are going to present is uncomplicated and based on the 
Dirichlet principle, ubiquitous in approximate string matching techniques. We 
partition each word d into A: -I- 1 disjoint pieces pi, ■ ■ ■ ,pk+i, of average length 
|(i|/(A: -I- 1) (hence the name “split index”), and each such piece acts as a key in 
a hash table Ht- The size of each piece Pi of word d is determined using the 
following formula: \pi\ = [|d|/(A: -I- 1)] and \pk+i\ = M| — bdi i-®- the piece 

size is rounded to the nearest integer and the last piece covers the characters 
which are not in other pieces. This means that the pieces might be in fact unequal 
in length, e.g., 3 and 2 for \d\ = 5 and k = 1. The values in Ht are the lists 
of words which have one of their pieces as the corresponding key. In this way, 
every word occurs on exactly fc -|- 1 lists. This seemingly bloats the space usage, 
still, in the case of small k the occupied space is acceptable. Moreover, instead 
of storing full words on the respective lists, we only store their “missing” prefix 
or suffix. For instance for the word table and A: = 1, we would have a relation 
tab —> le on one list (i.e. tab would be the key and le would be the value) and 
le — > tab on the other. 

In the case of A; = 1, we first populate each list with the pieces without 
their prefix and then with the pieces without the suffix; additionally we store 
the position on the list (as a 16-bit index) where the latter part begins. In this 
way, we traverse only a half of a list on average during the search. We can also 
support k larger than 1 — in this case, we ignore the piece order on a list, and 
we store [log 2 (A: -I- 1)] bits with each piece that indicate which piece of the word 
(i.e. where is the missing piece) is the list key. Let us note that this approach 
would also work for A: = 1, however, it turned out to be less efficient. 

As regards the implementation, our focus was on data compactness. In the 
hash table, we store the buckets which contain word pieces as keys (e.g., le) 
and pointers to the lists which store the missing pieces of the word (e.g., tab, 
ft). These pointers are always located right next to the keys, which means that 
unless we are very unlucky, a specific pointer should already be present in the 
CPU cache during the traversal. The memory layouts of these substructures are 
fully contiguous. Successive strings are represented by multiple characters with 



a prepended 8 -bit counter which specifies the length, and the counter with the 
value 0 indicates the end of the list. During the traversal, each length can be 
compared with the length of the piece of the pattern. As mentioned before, the 
words are partitioned into pieces of fixed length. This means that on average we 
calculate the Hamming distance for only a half of the pieces on the list, since 
the rest can be ignored based on their length. Any hash function for strings 
can be used, and two important considerations are the speed and the number 
of collisions, since a high number of collisions results in longer buckets, which 
may in turn have a negative effect on the query time (see Section 4 for further 
discussion). Figure 1 illustrates the layout of the split index. 

The preprocessing stage proceeds as follows: 

1. Duplicate keywords are removed from the dictionary T). 

The following steps refer to each word di from V. 

2. The word di is split into fc -|- 1 pieces. 

3. For each piece pf. if pi ^ Ht, we create a new list containing the missing 
pieces V = {pj : j € [1, fc -|- 1] A j 7 ^ z} and add it to the hash table (we 
append pi and the pointer to L„ to the bucket). Otherwise, if pi € Th, we 
append the missing pieces V to the already existing list L;. 

As regards the search: 

1. The pattern P is split into fc -I- 1 pieces. 

2. We search for each piece Pi (the prefix and the suffix if A: = 1): the list L/ 
is retrieved from the hash table or we continue if pi ^ Ht- Otherwise, we 
traverse each missing piece pj from Li. li \pj\ = \P\ — \pi\, the verification 
is performed and the result is returned if Ham{pj, P — Pi) ^ k (where the 
subtraction sign indicates substring removal). 

3. The pieces are combined into one word in order to present the answer. 


3.1 Complexity 

Let us consider the average word length |(i|, where |(i| = (X]l=i Mi|)/I^l- Average 
time complexity of the preprocessing stage is 0 {kn), where k is the allowed 
number of errors, and n is the total input dictionary size (i.e. the length of the 
concatenation of all words from V, n = Mil) - This is because for each word 

and for each piece pi we can either add the missing pieces to a new list or append 
them to the already existing one in 0(MI) time (if optimized; let us note that 
11? I Ml = n). We assume that adding a new element to the bucket takes constant 
time on average, and that the calculation of all hashes takes 0 {n) time in total. 
This is true irrespective of which list layout is used (there are two layouts for 
k = 1 and fc > 1, see the preceding paragraphs). The occupied space is equal to 
0 (kn), because each part appears on exactly k lists and in exactly 1 bucket. 

The average search complexity is 0{kt), where t is the average length of the 
list. We search for each of k+1 pieces of the pattern of length m, and when the list 
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Fig. 1. Split index for keyword indexing which shows the insertion of the word 
table for fc = 1. The index also stores the words left and tablet (only selected 
lists containing pieces of these two words are shown), and LI and L2 indicate 
pointers to the respective lists. The first cell of each list indicates a 1-based 
word position (i.e. the word count from the left) where the missing prefixes 
begin {k = 1, hence we deal with two parts, namely prefixes and suffixes), and 0 
means that the list has only missing suffixes. Adapted from Wikimedia Commons 
(author: Jorge Stolfi; available at http: //en. Wikipedia. org/wiki/File:Hash_ 
table_3_l_l_0_l_0_0_SP.svg; CC A-SA 3.0). 


corresponding to the piece Pi is found, it is traversed and at most t verifications 
are performed. Each verification takes at most 0{m\n[m,\dmax\)) time where 
dmax is the longest word in the dictionary^, but 0(1) time on average. Again, we 
assume that determining a location of the specific list, that is iterating a bucket, 
takes 0(1) time on average. As regards the list, its average length t is higher when 
there is a higher probability that two words di and d 2 from V have two parts of 
the same length I which match exactly, i.e. Pr{di[ii,ii+l — l] = (J 2 [* 2 )+ !])• 

^ Or 0{k) time, in theory, using the old longest common extension (LCE) based tech¬ 
nique from Landau and Vishkin [23], after 0(n log (T)-time preprocessing. 

















































Since all words are sampled from the same alphabet S, t depends on the alphabet 
size, that is t = Still, the dependence is rather indirect; in real-world 

dictionaries which store words from a given language, t will be rather dependent 
on the fc-th order entropy of the language. 


3.2 Compression 

In order to reduce storage requirements, we apply a basic compression technique. 
We find the most frequent g-grams in the word collection and replace their 
occurrences on the lists with unused symbols, e.g., byte values 128,..., 255. The 
values of q can be specified at the preprocessing stage, for instance q = 2 and 
(7 = 4 are reasonable for the English alphabet and DNA, respectively. Different 
q values can be also combined depending on the distribution of ( 7 -grams in the 
input text, i.e. we may try all possible combinations of ( 7 -grams up to a certain 
q value and select ones which provide the best compression. In such a case, 
longer ( 7 -grams should be encoded before shorter ones. For example, a word 
compression could be encoded as #p*s\ using the following substitution list: 
com --> #, re —> *, CO —> $, om sion —> \ (note that not all ( 7 -grams from the 
substitution list are used). Possibly even a recursive approach could be applied, 
although this would certainly have a substantial impact on the query time. 

The space usage could be further reduced by the use of a different character 
encoding. For the DNA (assuming 4 symbols only) it would be sufficient to use 
2 bits per character, and for the basic English alphabet 5 bits. In the latter 
case there are 26 letters, which in a simplified text can be augmented only with 
a space character, a few punctuation marks, and a capital letter flag. Such an 
approach would be also beneficial for space compaction, and it could have a 
further positive impact on cache usage. The compression naturally reduces the 
space while increasing the search time, and a sort of a middle ground can be 
achieved by deciding which additional information to store in the index. This 
can be for instance the length of an encoded (compressed) piece after decoding, 
which could eliminate some pieces based on their size without performing the 
decompression and explicit verification. 


4 Experimental results 

Experimental results were obtained on the machine equipped with the Intel i5- 
3230M processor running at 2.6 GHz and 8 GB DDR3 memory, and the G++ 
code was compiled with clang version 3.4-1 and run on the Ubuntu 14.04 OS. 

One of the crucial components of the split index is a hash function. Ideally, 
we would like to minimize the average length of the bucket (let us recall that we 
use chaining for collision resolution), however, the hash function should be also 
relatively fast because it has to be calculated for each of the fc -I- 1 parts of the 
pattern (of total length m). We investigated various hash functions, and it turned 
out that the differences in query times are not negligible, although the average 
length of the bucket was almost the same in all cases (relative differences were 



smaller than 1%). We can see in Table 1 that the fastest function was the xxhash 
(available on the Internet under the following link: https: //code .google. com/ 
p/xxhash/), and for this reason it was used for the calculation of other results. 


Hash function 

Query time (ps) 

xxhash 

0.93 

sdbm 

0.95 

FNVl 

0.95 

FNVfa 

0.95 

SuperFast 

0.96 

MurmurS 

0.97 

City 

0.99 

FARSH 

1.00 

Spooky V2 

1.04 

Farm 

1.04 


Table 1. Evaluated hash functions and search times per query for the English 
dictionary of size 2.67 MB and A: = 1. A list of common English misspellings was 
used as queries, max LF = 2.0. 


Decreasing the value of the load factor (LF) did not strictly provide a speedup 
in terms of the query time, as demonstrated in Figure 2. This can be explained 
by the fact that even though the relative reduction in the number of collisions 
was substantial, the absolute difference was equal to at most a few collisions per 
list. Moreover, when the LF was higher, pointers to the lists could be possibly 
closer to each other, which might have had a positive effect on cache utilization. 
The best query time was reported for the maximum LF value of 2.0, hence this 
value was used for the calculation of other results. 

In Table 2 we can see a linear increase in the index size and an exponential 
increase in query time with growing k. Even though we concentrate on fc = 1 and 
the most promising results are reported for this case, our index might remain 
competitive also for higher k values. 


k 

Query time (ps) 

Index size (KB) 

T 

0.51 

1,715 

2 

11.49 

2,248 

3 

62.85 

3,078 


Table 2. Query time and index size vs the error value k for the English language 
dictionary of size 0.79 MB. A list of common English misspellings was used as 
queries. 




Load factor 


Fig. 2. Query time and index size vs the load factor for the English dictionary 
of size 2.67MB and fc = 1. A list of common English misspellings was used as 
queries. The value of LF can be higher than 1.0 because we use chaining for 
collision resolution. 


Q-gram substitution coding provided a reduction in the index size, at the cost 
of increased query time. Q-grams were generated separately for each dictionary 
I? as a list of 100 g-grams which provided the best compression for T>, i.e. they 
minimized the size of all encoded words, Se = X]l=i \Enc{di)\. For the English 
language dictionaries, we also considered using only 2-grams or only 3-grams, 
and for the DNA only 2-grams (a maximum of 25 2-grams) and 4-grams, since 
mixing the g-grams of various sizes has a further negative impact on the query 
time. For the DNA, the queries were generated randomly by introducing noise 
into words sampled from dictionary, and their length was equal to the length of 
the particular word. Up to 3 errors were inserted, each with a 50% probability. 
For the English dictionaries we opted for the list of common misspellings, and 
the results were similar to the case of randomly generated queries. 

We can see the speed-to-space relation for the English dictionaries in Figure 3 
and for the DNA in Figure 4. In the case of English, using the optimal (from 
the compression point of view, i.e. minimizing the index size) combination of 
mixed g-grams provided almost the same index size as using only 2-grams. Sub¬ 
stitution coding methods performed better for the DNA (where u = 5) because 
the sequences are more repetitive. Let us note that the compression provided a 
higher relative decrease in index size with respect to the original text as the size 





of the dictionary increased. For instance, for the dictionary of size 627.8 MB the 
compression ratio was equal to 1.93 and the query time was still around 100 ps. 




Dictionary size (KB) 

Fig. 3. Query time and index size vs dictionary size for fc = 1, with and with¬ 
out g-gram coding. Mixed g-grams refer to the combination of g-grams which 
provided the best compression, and for the three dictionaries these were equal 
to ([2-, 3-, 4-] grams): [88, 8, 4], [96, 2, 2[, and [94, 4, 2[, respectively. English 
language dictionaries and the list of common English misspellings were used. 


Tested on the English language dictionaries, promising results were reported 
when compared to methods proposed by other authors. Others consider the Lev- 
enshtein distance as the edit distance, whereas we use the Hamming distance, 
which puts us at the advantageous position. Still, the provided speedup is signifi¬ 
cant, and we believe that the more restrictive Hamming distance is also an impor¬ 
tant measure of practical use. The implementations of other authors are available 
on the Internet (http://searchivcLrius.org/personal/software; https:// 
code.google.com/p/compact-approximate-string-dietionary/, from Boytsov 
and Chegrane and Belazzougui, respectively). As regards the results reported for 
the ME method and Boytsov’s Reduced alphabet neighborhood generation, it 
was not possible to accurately calculate the size of the index (both implementa¬ 
tions by Boytsov), and for this reason we used rough ratios based on index sizes 
reported by Boytsov for similar dictionary sizes. Let us note that we compare 
our algorithm with Chegrane and Belazzougui, who report better results when 












Fig. 4. Query time and index size vs dictionary size for k = 1, with and without 
g-gram coding. Mixed g-grams refer to the combination of g-grams which pro¬ 
vided the best compression, and these were equal to ([2-, 3-, 4-] grams): [16, 66, 
18] (due to computational constraints, they were calculated only for the first dic¬ 
tionary, but used for all four dictionaries). DNA dictionaries and the randomly 
generated queries were used. 


compared to Karch et al., who in turned claimed to be faster than other state-of- 
the-art methods [9,20]. We have not managed to identify any practice-oriented 
indexes for matching in dictionaries over any fixed alphabet S dedicated for 
the Hamming distance, which could be directly compared to our split index. 
The times for the brute-force algorithm are not listed, since they were roughly 3 
orders of magnitude higher than the ones presented. Consult Figure 5 for details. 

We also evaluated different word splitting schemes. For instance for k = 
1, one could split the word into two parts of different sizes, e.g., 6 —>■ (2,4) 
instead of 6 —(3, 3), however, unequal splitting methods caused slower queries 
when compared the the regular one. As regards Hamming distance calculation, 
it turned out that a naive implementation (i.e. simply iterating and comparing 
each character) was the fastest one. The compiler with automatic optimization 
was simply more efficient than other implementations (e.g., ones based directly 
on SSE instructions) that we have investigated. 











Index size (KB) 


Fig. 5. Query time vs index size for different methods. The method with com¬ 
pression encoded mixed g-grams. We used the Hamming distance, and the other 
authors used the Levenshtein distance for fc = 1. English language dictionaries 
of size 0.79 MB, 2.67 MB, and 5.8 MB were used as input, and the list of common 
misspellings was used for queries. 


5 Conclusions 

We have presented an index for dictionary matching with mismatches, which 
performed best for the Hamming distance of one. Its functionality could be 
extended by storing additional information in the lists that contain the missing 
parts of the words. This could be for instance a mapping of words to positions 
in the document, which would create an inverted index supporting approximate 
matching. 

The algorithm can be sped up by means of parallelization, since access to 
the index during the search procedure is read-only. In the most straightforward 
approach we could simply distribute individual words between multiple threads. 
A more fine-grained variation would be to concurrently operate on parts of the 
word after it has been split up (the number of parts depending on the k param¬ 
eter), or we could even access in parallel lists which contain candidate prefixes 
and suffixes. If we had a sufficient amount of threads at our disposal, these 
approaches could be combined. 











Appendix A 

The following data sets were used in order to obtain the experimental results: 

— iamerican — 0.79 MB, English, available from Linux packages 

— foster — 2.67MB, English, available at: http: //www.math. sjsu.edu/~foster/ 
dictionary.txt 

— iamerican-insane — 5.8 MB, English, available from Linux packages 

— DNA — 20-mers extracted from the genome of Drosophila melanogaster 
(available at: http://flybase.org/), sizes: 6.01MB, 135.89MB, 262.78 MB, 
and 627.80 MB 

— A list of common English misspellings — 44.2 KB (4,261 words), available at: 
http://en.Wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/ 
For_machines 
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